Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

172 ◾ Bioinformatics

As shown in Figure 5.2, the alignment statistics of one of the BAM files show the total

number of reads, the average reads length, the number and percentage of uniquely mapped

reads, splice statistics, statistics of the reads mapped to multiple genes, statistics of the

unmapped reads, and chimeric reads. Pay attention to the reads mapped to multiple loci

and chimeric; when their number is large, that indicates low-quality alignment. Remember

that this BAM file includes the alignments of chromosome 22 only. The number of reads

will be huge if the BAM file contains the alignments of all chromosomes.

In addition to the statistics in the STAR log files, there are a variety of programs for

assessing alignments in BAM files. Examples of those programs include Qualimap [16],

RNA-seQC [17], and RSeQC [18]. Those programs compute metrics for RNA-Seq data,

including per-transcript coverage, junction sequence distribution, genomic localization of

reads, 5′–3′ bias, and consistency of the library protocol. As an example, you can download

and use Qualimap to obtain an overall view about the alignment quality on an HTML for-

mat. You can download Qualimap from “http://qualimap.conesalab.org/” and unzip it in

your project directory. Run Qualimap for each sample and study the reports carefully. The

following script is an example of how to use it:

mkdir qc

qualimap_v2.2.1/qualimap rnaseq \

-outdir qc \

-a proportional \

-bam bam/norm_rep1.bam \

-p strand-specific-reverse \

-gtf gtf/hg38.ncbiRefSeq.gtf \

--java-mem-size=8G

The above script creates the directory “qc” where the Qualimap output files will be saved.

The program takes a BAM file and the reference annotation file as inputs and generates

an HTML report that includes summary statistics about read alignments, reads genomic

origin, transcript coverage profile, splice junction analysis, and figures about read genomic

origins, coverage profile along genes, coverage histogram, and junction analysis. As a biol-

ogist, you may need to study these metrics to have a general idea about the sample align-

ment before proceeding.

5.3.4 Quantification

Gene profiling or studying gene expression is centered in the quantification of aligned

reads per gene or locus. Quantification of reads begins by counting the number of reads

aligned to each gene annotated on the sequence of the reference sequence. Given a BAM

file with aligned RNA-Seq reads and a list of genomic features in an annotation file (GFT

format), the task of the read counting program is to count the number of reads mapping

to each feature. In general, a feature, in this case, is a gene which represents a transcript

or unions of exons of a gene for eukaryotic organisms. Some programs can also consider

exons as features. This is especially useful for checking alternative splicing in the eukary-

otic genes. A read in the BAM file may map to a single feature (unique) or may map or